Team Project Proposal

Authors:

For this document, the following packages are required:

library(knitr)
library(readxl)
suppressPackageStartupMessages(library(tidyverse))
library(dplyr)
library(here)
Warning: package 'here' was built under R version 4.4.1
here() starts at C:/Users/aloys/OneDrive/Documents/Year 1 Tri 3/AAI 1001 Data Engineering and Visualization/DEnVGrp3Proj

1 Original Data Visualization in News Media

The geographical distribution of Singapore’s population is a focal point of urban studies and public policy discussions. The introduction of a new visualization of demographic trends across various planning areas provides a novel perspective on this discourse. Our project aims to dissect the relationship between demographic characteristics and urban planning policies. The observed patterns, while not entirely random, may be consistent with hypotheses regarding the impact of strategic urban development on population distribution as seen in Figure 1.

Our visualization, inspired by the seminal work of the Singapore Department of Statistics (2023), encapsulates data from the year 2023, a period marked by significant urban and demographic changes. While our initial rendition has been commended for its clarity in showcasing trends, it is the addition of interactive elements that promises a deeper engagement with the data. Nonetheless, we acknowledge room for improvement. Incorporating interactive toggles, expanded temporal ranges, and detailed geospatial mappings will refine our approach, providing a more comprehensive exploration of how urban planning influences demographic distribution and housing patterns in Singapore.

Figure 1: Visualization of Choropleth Map of Resident Population Density by the Department of Statistics Singapore (Singstat 2023)

2 Critical Assessment of the Original The selected visualization from

The Singapore Department of Statistics presents several variables: population distribution (quantitative) and planning areas (categorical). Additionally, the visualization includes a heatmap that allow users to delve into subzones to see pattern in land development and population over time.

However, there are some shortcomings the team has identified:

  1. Complexity: While the visualization is thorough, its complexity may be overwhelming for some users, especially those unfamiliar with demographic data or geographical information systems (GIS).

  2. Accessibility: The reliance on color and detailed graphics may pose accessibility challenges for users with visual impairments or those who are not tech-savvy

  3. Data Density: The high density of information in a single visualization can lead to cognitive overload, where important details might be overlooked due to the sheer volume of data presented. (Front Psychol. 2023)

  4. Limited Temporal Range: The visualization only covers data for the year 2023, which constrains the analysis to a narrow timeframe and does not allow for the examination of trends over time.

  5. Lack of Customization: Users cannot customize the time range or select specific years of interest, limiting the depth of their analysis.

  6. Static Elements: While some elements are interactive, the visualization could benefit from more dynamic features such as changing bubble sizes or color gradients to depict demographic shifts over time.

3 Proposed Improvements

We propose to address the shortcomings of the original visualization as follows:

  1. Better contrast: Utilize high-contrast colors to improve accessibility for users with visual impairments, ensuring clarity and ease of interpretation for all users. One such example is the use of Color Universal Design (CUD) colors which are designed to be distinguishable by all users, including those with color vision deficiencies. (Okabe and Ito 2008)

  2. Reduced Data Density: Simplify the presentation by reducing the number of data points displayed simultaneously, thus preventing overcrowding and making the visualization more comprehensible.

  3. Interactive Elements: Hovering over a country will display a tooltip with detailed information on the population size of a certain region of singapore as well as the age profile.

  4. Expanded Temporal Ranges: Introduce options for users to select specific time periods for analysis, facilitating a deeper exploration of trends over time.

4 Data Cleaning

The Singapore Department of Statistics based its visualization on data collected by the Singapore Government from the year 2023 available in CSV format. The data includes the following columns: Planning Area, Population, Age Profile, and Gender (2023). For our improved visualization we will be using dataset dating as far back as 2000 to 2023. The department of statistics categorize the data into a range of 10 years (e.g 2000-2010, 2011-2020) and for data set that have not met the 10 year range are given in seperate files instead. As such we will first have to combined the data into a single file and then clean the data to ensure that it is in a Dataframe.

# Set the working directory to the root of the project folder
setwd(here::here())

# List all CSV files in the data folder
file_list <- list.files("data", pattern = "*.csv")

# Print the list of files
print(file_list)
[1] "respopagesex2000to2010.csv" "respopagesex2011to2020.csv"
[3] "respopagesex2021.csv"       "respopagesex2022.csv"      
[5] "respopagesex2023.csv"      
# Read and combine all CSV files
csv_list <- lapply(file_list, function(x) read.csv(file.path("data", x)))
df <- do.call(rbind, csv_list)

# Save the combined data frame to a new CSV file (optional if you want to view it before we start cleaning) 
# write.csv(df, file.path("data", "combined_data.csv"), row.names = FALSE)

Once the data has been combined, we will perform a summary of the data using the glimpse() and tail() function to understand the structure of the data and identify any missing values or inconsistencies.

# Display the first few rows of the combined data frame
head(df)
          PA        SZ Age     Sex Pop Time
1 Ang Mo Kio Cheng San   0   Males 140 2000
2 Ang Mo Kio Cheng San   0 Females 130 2000
3 Ang Mo Kio Cheng San   1   Males 180 2000
4 Ang Mo Kio Cheng San   1 Females 140 2000
5 Ang Mo Kio Cheng San   2   Males 160 2000
6 Ang Mo Kio Cheng San   2 Females 130 2000
# Display the last few rows of the combined data frame
tail(df)
            PA          SZ         Age     Sex Pop Time
1393751 Yishun Yishun West          88   Males  40 2023
1393752 Yishun Yishun West          88 Females  70 2023
1393753 Yishun Yishun West          89   Males  40 2023
1393754 Yishun Yishun West          89 Females  40 2023
1393755 Yishun Yishun West 90_and_Over   Males  70 2023
1393756 Yishun Yishun West 90_and_Over Females 200 2023

We can check the total number of rows in the data frame.

# Display the total number of rows in the data frame
nrow(df)
[1] 1393756

Based on the data type and structure, we will remove the columns that are not relevant to our visualization and clean the remaining columns to ensure consistency and accuracy. We will also aggregate the data to create a new data frame that consolidates the total population by age group, planning area, and time period.

# Deleting "SZ" and "Sex" columns as they are not relevant to our visualization
df <- df %>% select(-"SZ", -"Sex")
# Function to create age groups from 1,2,3,4... to 1-9, 10,19...
create_age_group <- function(age) {
  if (age == "90_and_over" || age == "90_and_Over") {
    return("90 and Over")
  } else {
    age_num <- as.numeric(age)
    group_start <- (age_num %/% 10) * 10
    group_end <- group_start + 9
    return(paste0(group_start, " to ", group_end))
  }
}
# Mutating the age column into a new AgeGroup column
df <- df %>% mutate(AgeGroup = sapply(Age, create_age_group))

head(df)
          PA Age Pop Time AgeGroup
1 Ang Mo Kio   0 140 2000   0 to 9
2 Ang Mo Kio   0 130 2000   0 to 9
3 Ang Mo Kio   1 180 2000   0 to 9
4 Ang Mo Kio   1 140 2000   0 to 9
5 Ang Mo Kio   2 160 2000   0 to 9
6 Ang Mo Kio   2 130 2000   0 to 9

Next, we will clean the ‘Pop’ and by removing any non-numeric characters and convert it to a numeric data type for aggregation to make the dataframe more memory efficient as having char data type for these columns can take up alot of memory usage which can hamper the work flow. Within one of the original data set, the data gathered in 2000, had a group of which did not state which Planning Area they were staying in. As such we will remove these rows as it is not relevant to our visualization. Afterwards we will then aggregate the data to calculate the total population by age group, planning area, and time period.

# Function to clean the 'Pop' column by removing non-numeric characters
clean_pop <- function(pop) {
  # Remove any non-digit characters
  clean_pop <- gsub("[^0-9]", "", pop)
  return(clean_pop)
}
# Function to drop any row containing "Not Stated" in the PA column as this is not relevant to our visualization
drop_not_stated <- function(df, column_name = "PA") {
  # Filter out rows where the specified column contains "Not Stated"
  df <- df %>% filter(!(.data[[column_name]] == "Not Stated"))
  return(df)
}
# Apply the drop_not_stated function to the dataframe
df<- drop_not_stated(df, "PA")
# Apply the cleaning function to the 'Pop' column
df$Pop <- sapply(df$Pop, clean_pop)
# Convert Pop to numeric for aggregation
df$Pop <- as.numeric(df$Pop)
# Move columns to a new data frame to consolidate total population, Age Group and Time
aggregated_df <- df %>%
  group_by(PA, AgeGroup, Time) %>%
  summarise(TotalPop = sum(Pop, na.rm = TRUE))
`summarise()` has grouped output by 'PA', 'AgeGroup'. You can override using
the `.groups` argument.
head(aggregated_df)
# A tibble: 6 × 4
# Groups:   PA, AgeGroup [1]
  PA         AgeGroup  Time TotalPop
  <chr>      <chr>    <int>    <dbl>
1 Ang Mo Kio 0 to 9    2000    21160
2 Ang Mo Kio 0 to 9    2001    19490
3 Ang Mo Kio 0 to 9    2002    18490
4 Ang Mo Kio 0 to 9    2003    17770
5 Ang Mo Kio 0 to 9    2004    17080
6 Ang Mo Kio 0 to 9    2005    16550

We can save the cleaned data to a new CSV file for future use.

# Saved for future use
write_csv(aggregated_df, "cleaned_data.csv")
library(sf)
Warning: package 'sf' was built under R version 4.4.1
Linking to GEOS 3.12.1, GDAL 3.8.4, PROJ 9.3.1; sf_use_s2() is TRUE
library(tmap)
Warning: package 'tmap' was built under R version 4.4.1
Breaking News: tmap 3.x is retiring. Please test v4, e.g. with
remotes::install_github('r-tmap/tmap')
# Read the KML file
singapore_kml <- st_read("singapore.kml")
Reading layer `ELD2020' from data source 
  `C:\Users\aloys\OneDrive\Documents\Year 1 Tri 3\AAI 1001 Data Engineering and Visualization\DEnVGrp3Proj\singapore.kml' 
  using driver `KML'
Simple feature collection with 31 features and 2 fields
Geometry type: MULTIPOLYGON
Dimension:     XY
Bounding box:  xmin: 103.6057 ymin: 1.158762 xmax: 104.0885 ymax: 1.470783
Geodetic CRS:  WGS 84
# Inspect the KML file structure
str(singapore_kml)
Classes 'sf' and 'data.frame':  31 obs. of  3 variables:
 $ Name       : chr  "RADIN MAS" "MOUNTBATTEN" "TANJONG PAGAR" "JALAN BESAR" ...
 $ Description: chr  "" "" "" "" ...
 $ geometry   :sfc_MULTIPOLYGON of length 31; first list element: List of 1
  ..$ :List of 1
  .. ..$ : num [1:131, 1:2] 104 104 104 104 104 ...
  ..- attr(*, "class")= chr [1:3] "XY" "MULTIPOLYGON" "sfg"
 - attr(*, "sf_column")= chr "geometry"
 - attr(*, "agr")= Factor w/ 3 levels "constant","aggregate",..: NA NA
  ..- attr(*, "names")= chr [1:2] "Name" "Description"
head(singapore_kml)
Simple feature collection with 6 features and 2 fields
Geometry type: MULTIPOLYGON
Dimension:     XY
Bounding box:  xmin: 103.6911 ymin: 1.260668 xmax: 103.9203 ymax: 1.344064
Geodetic CRS:  WGS 84
           Name Description                       geometry
1     RADIN MAS             MULTIPOLYGON (((103.8248 1....
2   MOUNTBATTEN             MULTIPOLYGON (((103.9203 1....
3 TANJONG PAGAR             MULTIPOLYGON (((103.8458 1....
4   JALAN BESAR             MULTIPOLYGON (((103.8738 1....
5    MACPHERSON             MULTIPOLYGON (((103.8818 1....
6       PIONEER             MULTIPOLYGON (((103.7083 1....
# Set tmap mode to "view" for interactive maps
tmap_mode("view")
tmap mode set to interactive viewing
tmap_options(check.and.fix = TRUE)
tmap_options(max.categories = 31)

# Create the interactive map
tm_shape(singapore_kml) +
  tm_borders("blue", lwd = 1) +
  tm_fill(col = "Name", palette = "Set3", alpha = 0.5) + # Replace 'Name' with the appropriate column name
  tm_text("Name", size = 0.7) + # Replace 'Name' with the appropriate column name
  tm_view(bbox = st_bbox(singapore_kml)) # Zoom to the extent of the Singapore data
Warning: The shape singapore_kml is invalid. See sf::st_is_valid
singapore_kml
Simple feature collection with 31 features and 2 fields
Geometry type: MULTIPOLYGON
Dimension:     XY
Bounding box:  xmin: 103.6057 ymin: 1.158762 xmax: 104.0885 ymax: 1.470783
Geodetic CRS:  WGS 84
First 10 features:
            Name Description                       geometry
1      RADIN MAS             MULTIPOLYGON (((103.8248 1....
2    MOUNTBATTEN             MULTIPOLYGON (((103.9203 1....
3  TANJONG PAGAR             MULTIPOLYGON (((103.8458 1....
4    JALAN BESAR             MULTIPOLYGON (((103.8738 1....
5     MACPHERSON             MULTIPOLYGON (((103.8818 1....
6        PIONEER             MULTIPOLYGON (((103.7083 1....
7   POTONG PASIR             MULTIPOLYGON (((103.889 1.3...
8          YUHUA             MULTIPOLYGON (((103.7373 1....
9    BUKIT BATOK             MULTIPOLYGON (((103.7484 1....
10        JURONG             MULTIPOLYGON (((103.7373 1....
# Load the libraries
library(sf)
library(ggplot2)

# Define the path to your KML file (ensure this path is correct)
kml_file_path <- "singapore2.kml"

# Read the KML file
singapore_map <- st_read(kml_file_path)
Reading layer `singapore_Division_level_2' from data source 
  `C:\Users\aloys\OneDrive\Documents\Year 1 Tri 3\AAI 1001 Data Engineering and Visualization\DEnVGrp3Proj\singapore2.kml' 
  using driver `KML'
Simple feature collection with 10 features and 2 fields
Geometry type: MULTIPOLYGON
Dimension:     XY
Bounding box:  xmin: 103.6811 ymin: 1.254808 xmax: 104.0336 ymax: 1.416214
Geodetic CRS:  WGS 84
# Inspect the KML data to understand its structure
print(head(singapore_map))
Simple feature collection with 6 features and 2 fields
Geometry type: MULTIPOLYGON
Dimension:     XY
Bounding box:  xmin: 103.6811 ymin: 1.284839 xmax: 104.0336 ymax: 1.416214
Geodetic CRS:  WGS 84
  Name Description                       geometry
1                  MULTIPOLYGON (((103.9321 1....
2                  MULTIPOLYGON (((103.8188 1....
3                  MULTIPOLYGON (((103.9838 1....
4                  MULTIPOLYGON (((103.8581 1....
5                  MULTIPOLYGON (((103.6973 1....
6                  MULTIPOLYGON (((103.9067 1....
# Plot the map with ggplot2
ggplot(data = singapore_map) +
  geom_sf() +
  theme_minimal() +
  labs(title = "Planning Areas in Singapore",
       x = "Longitude",
       y = "Latitude")

# Load the libraries
library(leaflet)
Warning: package 'leaflet' was built under R version 4.4.1
library(sf)
library(lwgeom)
Warning: package 'lwgeom' was built under R version 4.4.1
Linking to liblwgeom 3.0.0beta1 r16016, GEOS 3.12.1, PROJ 9.3.1

Attaching package: 'lwgeom'
The following object is masked from 'package:sf':

    st_perimeter
# Define the path to your KML file (ensure this path is correct)
kml_file_path <- "singapore.kml"

# Read the KML file
singapore_map <- st_read(kml_file_path)
Reading layer `ELD2020' from data source 
  `C:\Users\aloys\OneDrive\Documents\Year 1 Tri 3\AAI 1001 Data Engineering and Visualization\DEnVGrp3Proj\singapore.kml' 
  using driver `KML'
Simple feature collection with 31 features and 2 fields
Geometry type: MULTIPOLYGON
Dimension:     XY
Bounding box:  xmin: 103.6057 ymin: 1.158762 xmax: 104.0885 ymax: 1.470783
Geodetic CRS:  WGS 84
# Clean and validate the geometries
singapore_map_clean <- st_make_valid(singapore_map)

# Optionally, check for and remove any empty geometries
singapore_map_clean <- singapore_map_clean[!st_is_empty(singapore_map_clean),]

# Inspect the cleaned KML data to ensure it's valid
print(head(singapore_map_clean))
Simple feature collection with 6 features and 2 fields
Geometry type: MULTIPOLYGON
Dimension:     XY
Bounding box:  xmin: 103.6911 ymin: 1.260668 xmax: 103.9203 ymax: 1.344064
Geodetic CRS:  WGS 84
           Name Description                       geometry
1     RADIN MAS             MULTIPOLYGON (((103.8248 1....
2   MOUNTBATTEN             MULTIPOLYGON (((103.9203 1....
3 TANJONG PAGAR             MULTIPOLYGON (((103.8458 1....
4   JALAN BESAR             MULTIPOLYGON (((103.8738 1....
5    MACPHERSON             MULTIPOLYGON (((103.8818 1....
6       PIONEER             MULTIPOLYGON (((103.7083 1....
# Create an interactive map with leaflet
leaflet(singapore_map_clean) %>%
  addTiles() %>%
  addPolygons(
    color = "#444444", 
    weight = 1, 
    smoothFactor = 0.5,
    opacity = 1.0, 
    fillOpacity = 0.5,
    fillColor = ~colorQuantile("YlGnBu", NULL)(st_area(singapore_map_clean)),
    highlightOptions = highlightOptions(
      color = "white", 
      weight = 2,
      bringToFront = TRUE
    ),
    label = ~paste("Planning Area: ", Name)
  ) %>%
  addLegend(
    position = "bottomright", 
    pal = colorQuantile("YlGnBu", NULL), 
    values = ~st_area(singapore_map_clean),
    title = "Area"
  )
# Load the libraries
library(sf)
library(lwgeom)
library(plotly)
Warning: package 'plotly' was built under R version 4.4.1

Attaching package: 'plotly'
The following object is masked from 'package:ggplot2':

    last_plot
The following object is masked from 'package:stats':

    filter
The following object is masked from 'package:graphics':

    layout
# Define the path to your KML file (ensure this path is correct)
kml_file_path <- "singapore.kml"

# Read the KML file
singapore_map <- st_read(kml_file_path)
Reading layer `ELD2020' from data source 
  `C:\Users\aloys\OneDrive\Documents\Year 1 Tri 3\AAI 1001 Data Engineering and Visualization\DEnVGrp3Proj\singapore.kml' 
  using driver `KML'
Simple feature collection with 31 features and 2 fields
Geometry type: MULTIPOLYGON
Dimension:     XY
Bounding box:  xmin: 103.6057 ymin: 1.158762 xmax: 104.0885 ymax: 1.470783
Geodetic CRS:  WGS 84
# Clean and validate the geometries
singapore_map_clean <- st_make_valid(singapore_map)

# Optionally, check for and remove any empty geometries
singapore_map_clean <- singapore_map_clean[!st_is_empty(singapore_map_clean),]

# Inspect the cleaned KML data to ensure it's valid
print(head(singapore_map_clean))
Simple feature collection with 6 features and 2 fields
Geometry type: MULTIPOLYGON
Dimension:     XY
Bounding box:  xmin: 103.6911 ymin: 1.260668 xmax: 103.9203 ymax: 1.344064
Geodetic CRS:  WGS 84
           Name Description                       geometry
1     RADIN MAS             MULTIPOLYGON (((103.8248 1....
2   MOUNTBATTEN             MULTIPOLYGON (((103.9203 1....
3 TANJONG PAGAR             MULTIPOLYGON (((103.8458 1....
4   JALAN BESAR             MULTIPOLYGON (((103.8738 1....
5    MACPHERSON             MULTIPOLYGON (((103.8818 1....
6       PIONEER             MULTIPOLYGON (((103.7083 1....
# Create a Plotly map
plot_ly(singapore_map_clean) %>%
  add_sf(
    aes(fill = ~st_area(singapore_map_clean), 
         text = ~paste("Planning Area: ", Name)),
    color = I("blue"),
    opacity = 0.5
  ) %>%
  layout(
    mapbox = list(
      style = "carto-positron",
      zoom = 10,
      center = list(lat = 1.3521, lon = 103.8198)
    ),
    margin = list(r = 0, l = 0, t = 0, b = 0)
  ) %>%
  add_annotations(
    text = "Area",
    x = 0.01,
    y = 0.95,
    xref = "paper",
    yref = "paper",
    showarrow = FALSE,
    font = list(size = 12, color = "black")
  ) %>%
  colorbar(
    title = "Area",
    len = 0.5,
    y = 0.5,
    x = 0.01,
    yref = "paper",
    xref = "paper"
  )
No trace type specified:
  Based on info supplied, a 'scatter' trace seems appropriate.
  Read more about this trace type -> https://plotly.com/r/reference/#scatter
Warning: Didn't find a colorbar to modify.

Next how do we get the cleaned_data.csv file

# Load the cleaned data
cleaned_data <- read_csv("cleaned_data.csv")
Rows: 13190 Columns: 4
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
chr (2): PA, AgeGroup
dbl (2): Time, TotalPop

ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
# Display the first few rows of the cleaned data
head(cleaned_data)
# A tibble: 6 × 4
  PA         AgeGroup  Time TotalPop
  <chr>      <chr>    <dbl>    <dbl>
1 Ang Mo Kio 0 to 9    2000    21160
2 Ang Mo Kio 0 to 9    2001    19490
3 Ang Mo Kio 0 to 9    2002    18490
4 Ang Mo Kio 0 to 9    2003    17770
5 Ang Mo Kio 0 to 9    2004    17080
6 Ang Mo Kio 0 to 9    2005    16550

5 Conclusion

The data is now ready for visualization. The next step will be to create a plot that can effectively communicate the relationship between the population density in each region of Singapore over time, and additionally allow curious readers to explore the data even further using interactivity. We will use ggplot2 package to create the plot, and plotly to add interactivity.

6 References

  1. Arnold, M., Goldschmitt, M., & Rigotti, T. (2023, June 21). Dealing with information overload: A comprehensive review. Frontiers in psychology. https://www.ncbi.nlm.nih.gov/pmc/articles/PMC10322198/

  2. Okabe, M., & Ito, K. (2008). Color Universal Design (CUD): How to make figures and presentations that are friendly to Colorblind people. https://jfly.uni-koeln.de/color/

  3. Singapore Department of Statistics. (2023). Population Trends 2023. https://www.singstat.gov.sg/-/media/files/publications/population/population2023.ashx

  4. Department of Statistics Singapore. (2000 - 2023). Singapore Residents by Planning Area / Subzone, Single Year of Age and Sex (June 2000-2010, June 2011-2020, June 2021, June 2022, June 2023) [Data set]. https://www.singstat.gov.sg/find-data/search-by-theme/population/geographic-distribution/latest-data